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Abstract 

Feedforward neural networks with error backpropagation (FFBP) are widely applied to 
pattern recognition. One general problem encountered with this type of neural networks is 
the uncertainty, whether the minimization procedure has converged to a global minimum 
of the cost function. To overcome this problem a novel approach to minimize the error 
function is presented. It allows to monitor the approach to the global minimum and as an 
outcome several ambiguities related to the choice of free parameters of the minimization 
procedure are removed. 
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1 Introduction 



In high energy physics the separation of signal to background usually turns out to be a multi- 
dimensional classification problem with many variables involved in order to achieve a reasonable 
rejection factor. This is the domain, where neural networks with their intrinsic ability to deal 
with many dimensions, are worth to be applied. The output values of neural networks (NN) 
can be interpreted as estimators of a posteriori Bayesian probabilities which provide the link 
to classification problems of higher order [?]. 

A neural network can be regarded as a non-linear combination of several transformation ma- 
trices, with entries (denoted as weights) adjusted in the training phase by a least squares 
minimization of an error function. There are several technical problems associated with the 
training of the neural network. Real world applications rarely allow a perfect separation of 
patterns, i.e. given a problem with patterns of two classes C\ and C2, a certain fraction of 
patterns belonging to class C\ will look like patterns from class C 2 and vice versa. This ef- 
fect may be denoted as the confusing teacher problem. In high energy physics one deals with 
overlapping distributions which will cause an a priori contamination, i.e. indistinguishable 
patterns assigned to different classes. The non-linearity of the problem and the number of free 
parameters involved, enhance the possibility of the minimization procedure to converge to a 
local minimum of the error function. This leads to a deterioration of the separation ability and 
therefore a poorer estimate of the Bayesian probabilities, which are lower limits to the prob- 
ability of error in any problem of classification. The type of network used in this analysis is 
known as feedforward neural network with error backpropagation (FFBP) [?]. The name orig- 
inates from the specific architecture of the transformation matrices and the method applied to 
optimize their entries. A pattern is represented by an input vector whose entries are processed 
by the network in several layers of units. Each unit feeds only the units of the following layer 
(feedforward). During the minimization procedure the calculated difference between the actual 
network output and the desired output is used to adjust the weights (error backpropagation) [?]. 

A general description rule to avoid problems with local minima in the minimization procedure 
for feedforward neural networks with a quadratic cost function will be presented in this article. 
The basic features of the new model are demonstrated by a one-dimensional problem. 



2 Mathematical foundation of feedforward neural net- 
works 

Each pattern to be classified is represented by a vector X of dimension K (called the input 
vector). For the purpose of estimating the Bayesian a posteriori probabilities the input vector 
X is projected into the output space by means of the neural network formula 

y = nx) , (i) 
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in which Y is of dimension J, equal to the total number of possible classes Cj, with / e [1,/]. 
Standard FFBP use for / 
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with i e [1,/] , j e[l,J] , fee [1,K] and where the Einstein convention for same indices was 
used. For two overlapping Gaussian distributions the function g represents the exact solution 
for the Bayesian probabilities which is the motivation for the choice of this particular sigmoid 
function [?]. The weights Ay, Bjk, otj and $ are the free parameters of the fit function /, and 
J denotes the number of hidden nodes. In analogy to spin Ising models t is called temperature 
and is usually set to one. For any classification problem the aim is to achieve for an input 
vector X belonging to class C\ the output value Y = 0(Ci), where 0{C{) is a unit vector 



whose components are all zero except for the entry with index I. For any Ay, 
initialized at random in an interval [— e, +el , one defines the error function 
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where the first sum runs over all N patterns available and the second sum runs over all possible 
classes I. This error function is minimized iteratively by a gradient descent method 
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where t] is the step size ( rj e (0, oo) ) and < k < 1 is the weight of the momentum term. 
The same procedure applies to the matrices B, a and p. The upper index (p) denotes the 
iteration step. The momentum term serves to damp possible oscillations and to speed up the 
convergence in regions where E is almost flat [?]. 



Typically FFBP consist of many free parameters which need to be adjusted by the minimization 
procedure. Due to the non-linearity of the neural net function f(X) it is very likely that local 
minima of the error function E will prevent the convergence to the global minimum which re- 
sults in a deterioration of the classification ability. One common method to avoid this problem 
is to change the definition of E, i.e. instead of averaging over all patterns N, the sum for n in 
equation (|^) extends only over N patterns with 1 < N < N (called incremental updating). 
This introduces some randomness into the minimization procedure which might help to over- 
come local minima but introduces at the same time a new free parameter N which can only be 
optimized by trials. For N = N the method is called batch mode updating. 
For standard FFBP the temperature t does not affect the performance of the minimization 
procedure. As for the entries of the matrices, no constraints are imposed, any change of t can 
be compensated by an overall rescaling of the weights. 



The least squares fit procedure requires several input parameters whose values are to be assumed 
and tested, i.e. 
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• the temperature (t), 

• the initial range of the weights (e), 

• the weight of the momentum term (n), 

• the step size (77), and 

• the number of patterns over which the sum for E extends (iV). 

The performance of the FFBP is strongly influenced by the choice of these parameters. For 
instance the possibility of getting trapped in a local minimum is enhanced for a wrong choice 
of e. Often the surface of E is probed by initializing the matrices of f{X) at different points 
in the parameter space and by choosing different sets of values for the above listed parameters. 
Each of the networks will achieve a different performance on an independent test sample, which 
has not been part of the sample used to minimize the cost function. This usually happens if the 
minimization procedure converges to a local minimum. It can be shown, that an average over 
the output values of the different networks improves significantly the overall performance on 
the test sample. This method is referred to as the ensemble method [?]. However, it does not 
ensure an optimal solution and the results depend on the number of trials. Another approach 
is denoted as weight decay and reduces the number of weights by adding a penalty term to 
the cost function. This term depends on the size of the weights and thus gives each weight a 
tendency to decay to zero. Thereby it is less likely that the error function exhibits local minima 
because it depends on less weights [?]. 



The problem of restricting the parameter space to a region which ensures convergence to the 
global minimum remains thus to a large extend unsolved. 



3 Modified FFBP model 

Let's assume the entries of the input vector X to be of the order of unity. The matrices Ay 
and Bjk can be normalized for each row, i.e. 

E A% = 1 , V* ; E B% = 1 , Vj . (6) 

3=1 fc=l 

Thereby Bjk and ay denote the normal and the Euclidean distance to zero of the j's hyperplane 
in the space of the input variables as depicted in figure [TJ. The hyperplanes are defined by the 
equations Bjk X k — acj = 0. The same is true for the hyperplanes in the space of the hidden 
variables, with the replacement Bjk — > Aij , ctj —>■ /?» and Xk —>■ gj. These constraints 
remove the dependence of the minimization procedure on e. The weights ol 3 - and $ are initially 
set to zero such that only the orientation of the hyperplanes vary. Due to this modification the 
role of t becomes a major one and rules the overall structure of the error function. For t — > 00 
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the values of Yi are all equal to 0.5 for any finite X. The value of E will thus converge to a 
constant, i.e. 
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In the limit of t ~ the sigmoid function g becomes the step function 0. This results in 
probabilities Y equal to 1 or 0, thus in non overlapping distributions in the input space, i.e. the 
input distributions are completely separated. In terms of the parameter t, the error function 
E acquires a well defined structure. The contribution to E of patterns belonging to class C\ is 
determined by the following expression : 



E(Q) = fdX P{d) P{X\d) \ ( Y i( x (Ci)) ~ Oi(Q) f 
J 2. =1 > > 



(8) 
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If an input vector X belongs to class C\ the function P{X\C\) denotes the probability distri- 
bution of X and P{C{) the probability of class C\. Both functions depend upon the problem 
under investigation, thus A(Cj) is the only part of equation @ which changes as a function of 
the parameter t. To analyse the high t-behavior of E(C{) one can expand A(C/) in j. 
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For small |, when terms of the order O(jg) or higher can be neglected, the error function be- 
comes a quadratic sum of the weights with only one minimum. In the next section this will be 
illustrated with an example. 



Thus, while for high temperatures the error function has a smooth behavior, at low temperatures 
all its structures are present. This transition is continuous and it is reasonable to assume that 
the global minimum of the error function becomes at high temperatures the only minimum of 
E. The idea is then to start the minimization at high values of t and converge to the region 
of the minimum of E in this regime. The resulting weights are expected to be already close to 
those corresponding to the global minimum. If so, further decrease of the temperature should 
lead to the global minimum without the risk of being trapped in a local minimum. Therefore 
the summation in equation (|3|) must extend over the whole pattern sample to determine the 
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position of the global minimum as precisely as possible , i.e. the free parameter N is set to N. 
In the minimization procedure the temperature is changed in the same way as the weights but 
the step size for the temperature is reduced by one order of magnitude. This ensures a faster 
convergence behavior for the weights, therefore 



= tM - -^-AtW (10) 
10 V ' 

at 

In the low t region E becomes very steep which might lead during the minimization to oscilla- 
tions in t. Any step in the wrong direction and the error function could yield huge derivatives 
for t. Thus given that the functional dependence of E on t is of the form E ~ tanh 2 j, the 



step size rj must be t-dependent, i.e. 



rj(t) = 1+ 7 - tanh 2 ^ , (12) 

where the new parameter 7 is the pedestal value for f](t). A value of 7 of 0.1 ensures that only 
about 10% of the calculated derivatives will contribute to the change of the weights for t < 0.5. 



3.1 Determination of t(°) 

Usually one aims to separate two distinct distributions. In that case the neural net formula (fj) 
can be simplified. We set 1 = 1, change the sigmoid function to <?(...) = tanh(...) and assign 
two output values to the now scalar variable 0(Ci) : 



0(C t . 



+ 1 , if X belongs to class C\ 
— 1 , if X belongs to class C 2 



Therefore Y is a scalar and becomes an estimator for the probability function 

Y w P{C X \X) - P(C 2 \X) , (13) 

with P(Ci\X) being the a posteriori Bayesian probability that X belongs to the class Cj. 
For the case of two overlapping Gaussian distributions <?(...) represents the exact solution for 
the Bayesian probabilities. If we consider a one-dimensional problem (K = 1) and allow for 
simplicity just one cut ( J = 1) formula (0) can be reduced to : 



Y(X(Ci)) = tanh 



X(Q) - d 



t 
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with d as the only remaining weight, corresponding to atj in equation @. Let us assume a flat 
conditional probability distribution for the different classes, i.e. 



1 



if Xt < X < X, 



P( x\ Q)ii = 1 • else ;- re - ■ (15) 

The contribution of P(X\Cj)ij to the error function E can be evaluated analytically : 
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For high values of t the non-linear functions in equation fll6f) are expanded in powers of (^j-^)- 



Neglecting terms of the order of 0( 
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which might lead to a local minimum. It's partial derivative relative to d turns out to be only 
quadratic : 
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This already proves that E has only one maximum and one minimum. The quadratic de- 
pendence of the partial derivative d^E on the weight d vanishes for P(C{) = 0.5 due to 
the definition of 0(C{). With this additional requirement the error function E depends only 
quadratically on the weight d. Thereby the optimal a priori probability for the minimization 
procedure is determined to be 0.5. As Y will be an estimator for an a posteriori probability it 
is possible afterwards to reweigh the result to any a priori probability under investigation. The 
t-dependence of Eij(Ci) is determined by a sum of 2 terms 
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which are depicted in figure |2|. The two terms exhibit a different behavior for t —>■ and 
t — > oo. The first term (J) dominates in the low t region and therefore determines the structure 
of the error function E in this temperature range. For high values of t the second term (II) 
dominates over the first one. Both terms have almost equal weight for 1/t — 1.5. Since the 
proposed method of minimizing the error function requires that initially be chosen such 
that E exhibits only one minimum, it implies 1/t^ ^ < 1.5. At t = 5 the second term (II) has 
a 10-times bigger weight than the first term (I) which should satisfy the requirements necessary 
for the approximation done in equation With the assumption that the values of the input 
vector X are of order one, the general prescription for the initial value of the temperature t is 
thus 

t (0) > 5 . (20) 

This will be illustrated by a numerical example. Suppose one aims to separate the two one- 
dimensional overlapping distributions with flat conditional probabilities as defined in equation 



dl5|) and assumes that 



P(X\C 1 ) = 0.7P(X\C 1 ) 12 + 0.3 P(X\C 1 ) 3i 
P(X\C 2 ) = 1.0 P(X\C 2 ) 56 
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X x = -4.0 X 2 = 0.5 X 3 = 3.0 
X 4 = 3.1 X 5 = -0.5 X 6 = 4.0 

With P(Ci) = P(C-i) = 0.5 one gets a surface of the error function E as depicted in figure |^. 
If the minimization procedure were to start at any value of d and > 5 it would converge 
to the global minimum of E without the possibility of getting trapped in a local minimum. 
In the case of non-overlapping distributions the temperature t will converge to zero to model 
probabilities equal to one. If the distributions overlap to 100 % the temperature will converge 
to infinity. Thereby the final value of t = becomes a measure of the overlap of the two 
distinct distributions, i.e. 

£(oo) 

Overlap « ^ . (21) 

Thus to summarize, the proposed modified neural network differs from networks using standard 
backpropagation as follows : 



Parameter 


Name 


New model 


Standard FFBP 


m 


The initial value 
of the temperature 


t° > 5 . The temperature is not 
constant and changes for 
each iteration step. 


Not well defined. 
Usually t is not changed 
throughout the 
minimization procedure 
and thus set to t = 1. 


e 


The absolute range 
of the initial 
weights 


Cancelled, as the weights A and B 
are row-wise normalized to one, 
and a and (3 are initially set to zero 


Not well defined. 
Usually e < 0.01 


K 


The weight of the 
momentum term 


Not well defined 


Not well defined 


V 


The step size 


T)(t) = 1.0 + 7 - tanh^ \ 


Not well defined. 
Usually r? < 0.001 
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The pedestal for ij 


Not well defined. Usually set to 0.1 


Does not exist in 
this model 


N 


The # of patterns 
to sum over in E 


N = N 


Not well defined. 
Usually set to N « 10 


P(Ci) 


The a priori 
probability 
of class C\ 


For the case of two classes 
determined to P{C{) = 0.5, 
otherwise not well defined 


Not well defined 



4 Conclusions 

A novel method to minimize the quadratic cost function of a neural network with error back- 
propagation has been presented. The essential modification is the row-wise normalization of 
the matrices A and B which represent part of the weights of the neural network. Thus, the 
entries of each row of A and B acquire the meaning of normals which define the orientation of 
hyperplanes. Due to the normalization, the error function E obtains a well defined structure 
as a function of the free parameter t, denoted as the temperature. It has been proven that for 
high values of t, when terms of the order 0(4) or higher can be neglected, the cost function E 
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always exhibits a quadratic dependence on the weights and thus only one minimum. For low 
temperatures all structures of E are apparent and local minima might exist. This transition is 
continuous and it is natural to assume, that the single minimum of the cost function at high 
values of t leads to the global minimum of E at low temperatures. However, there is no rigorous 
proof as yet, that this should be the case. 

The minimization procedure starts at high temperatures and converges first to the single min- 
imum of the error function in this range of t. Further decrease of t should lead to the global 
minimum without the risk to converge to a local minimum. Similar to the weights, the temper- 
ature becomes a parameter whose value is determined by the minimization procedure. Assum- 
ing the entries of the input vector X to be of the order of one, the initial value of t should be 
> 5 as derived from an one-dimensional example. Multi-dimensional problems are nested 
superpositions of the one-dimensional example, therefore this range for should still ensure 
the quadratic dependence of the cost function on the weights. Several free parameters of the 
standard minimization procedure for FFBP, whose values are to be assumed and tested, are 
constrained. Thus, without any fine tuning, the new model is applicable to any classification 
problem. 

The new method described in this paper has been successfully applied to the problem of electron 
identification [?] in the ZEUS ldetector [?] at HERA. At HERA electrons of 30 GeV collide with 
protons of 820 GeV. High energetic particles and jets of particles, mainly hadronic particles, 
emerge from the interaction point and deposit energy in the spatially segmented uranium- 
scintillator calorimeter (CAL) of the ZEUS detector. A certain fraction of the interactions 
between electrons and protons are characterized by the presence in the final state of the electron 
scattered under a large angle, which is thus in the geometrical acceptance of the CAL. The 
aim is to select this type of events which are believed to originate from the scattering of the 
electron on a point like constituent of the proton. The showering properties of electrons and 
hadrons in an absorber material are different and it is possible to identify the particle type by 
the pattern of the energy deposits in the CAL. The longitudinal and transversal segmentation 
of the CAL provides 54 values reflecting the spatial shape of the shower. The shape depends 
also on the angle of incidence of the showering particle. After including this angle the input 
patterns are 55-dimensional. Using the new method, a neural network has been trained on 
patterns originating from electrons and hadrons. In comparison with a classical approach, the 
neural network separates the distinct distributions better, giving a typical increase of about 
10% in efficiency and purity. A principle component analysis has shown that this improvement 
is achieved through the use of all the 55 variables. 
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5 Appendix 



This new minimization procedure has been implemented in a FORTRAN program called dEXTRa. 
The program is controlled by a configuration file named dextra . cnf where all parameters, 
paths and options are set. It must be located in the directory from which dEXTRa is started. 
The value of the error function E for t — for both the training and an independent test 
pattern sample can be calculated during the minimization procedure. This method is denoted 
as cross-validation and serves to check for over-fitting, i.e. when the network starts to pick up 
fluctuations from the training sample [?] . After the training procedure the final set of matrices 
and parameters can be written to a file which afterwards can be read in again for application 
to new pattern samples. 

The program is available on request from the author. For further questions please contact 
sinkus@zow .desy.de. 
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XI 



Figure 1: Graphical representation of the first transformation of the neural net formula (2). 
The matrix operation Bjk X k — <x, calculates the Euclidean distances of the input vector X to 
the J hyperplanes in the space of the input variables (Hesse's normal form). An example for 
the first hyperplane (j = 1) in 3 dimensions (K = 3) is depicted. Each row of the matrix Bjk 
denote a normal of a hyperplane in the space of the input variables. The value otj determines 
the distance of the j's hyperplanes to zero. 
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Figure 2: The t- dependence of the individual terms contributing to the error function Eij(C{) 
for a one- dimensional distribution with a flat conditional probability. 




Figure 3: Structure of the error function E for the one-dimensional example (described in the 
text) as a function of the weight d and the temperature t. The global minimum of this specific 
error function is placed at d « 0.5 and t ~ 4.5. Local minima of E lie below t ~ 1.3. 
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